{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LlamaParse Usage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index llama-parse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-02-02 11:10:10--  https://arxiv.org/pdf/1706.03762.pdf\n",
      "Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.3.42, 151.101.67.42, ...\n",
      "Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2215244 (2.1M) [application/pdf]\n",
      "Saving to: ‘./attention.pdf’\n",
      "\n",
      "./attention.pdf     100%[===================>]   2.11M  --.-KB/s    in 0.08s   \n",
      "\n",
      "2024-02-02 11:10:10 (25.9 MB/s) - ‘./attention.pdf’ saved [2215244/2215244]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget \"https://arxiv.org/pdf/1706.03762.pdf\" -O \"./attention.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# llama-parse is async-first, running the sync code in a notebook requires the use of nest_asyncio\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "import os\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id dd0b8e31-0c09-4497-b78a-cc1c92f1d6cf\n"
     ]
    }
   ],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "documents = LlamaParse(result_type=\"text\").load_data(\"./attention.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ad\n",
      "relying entirely on an attention mechanism to draw global dependencies between input and output.\n",
      "The Transformer allows for significantly more parallelization and can reach a new state of the art in\n",
      "translation quality after being trained for as little as twelve hours on eight P100 GPUs.\n",
      "2    Background\n",
      "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n",
      "[16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building\n",
      "block, computing hidden representations in parallel for all input and output positions. In these models,\n",
      "the number of operations required to relate signals from two arbitrary input or output positions grows\n",
      "in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\n",
      "it more difficult to learn dependencies between distant positions [12]. In the Transformer this is\n",
      "reduced to a constant number of operations, albeit at the cost of reduced effective res\n"
     ]
    }
   ],
   "source": [
    "print(documents[0].text[6000:7000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id d4531453-1bbb-48c4-8324-ae9fea9f2fa2\n"
     ]
    }
   ],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "documents = LlamaParse(result_type=\"markdown\").load_data(\"./attention.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ction describes the training regime for our models.\n",
      "\n",
      "##### Training Data and Batching\n",
      "\n",
      "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million\n",
      "sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-\n",
      "target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT\n",
      "2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece\n",
      "vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training\n",
      "batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000\n",
      "target tokens.\n",
      "\n",
      "##### Hardware and Schedule\n",
      "\n",
      "We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using\n",
      "the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We\n",
      "trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the\n",
      "bo...\n"
     ]
    }
   ],
   "source": [
    "print(documents[0].text[20000:21000] + \"...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
